Sending and Receiving Files¶
Not all calculations can return their result via the remotemanager syntax. For this reason, Dataset
also allows you to work with files, providing the extra_files_send
and extra_files_recv
hooks.
These perform as you would expect, a dataset which has some extra files to send will attempt to grab those and send them with each run. Likewise, if extra files are specified to be pulled, a fetch_results()
call will attempt to fetch those files also.
Lets start with a function which merges two files to demonstrate this:
[2]:
from remotemanager import Dataset
def merge(fpath_a, fpath_b, fpath_c):
with open(fpath_a, 'r') as o:
data_a = o.read()
with open(fpath_b, 'r') as o:
data_b = o.read()
with open(fpath_c, 'w+') as o:
o.write(data_a + '\n' + data_b)
merge_files = Dataset(merge, skip=False)
Now we have our function, we need to create some files to send and merge:
[3]:
with open(f'temp_file_a.txt', 'w+') as o:
o.write('hello, world!')
with open(f'temp_file_b.txt', 'w+') as o:
o.write('add me to the output!')
[4]:
args = {'fpath_a': 'temp_file_a.txt',
'fpath_b': 'temp_file_b.txt',
'fpath_c': 'output.txt'}
merge_files.append_run(args = args,
extra_files_send = ['temp_file_a.txt', 'temp_file_b.txt'],
extra_files_recv = ['output.txt'])
appended run runner-0
Now run and collect our results:
[5]:
merge_files.run()
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 7 Files in 2 Transfers... Done
Remotely executing 1/1 Runners
[5]:
True
[6]:
merge_files.wait(1, timeout=10)
merge_files.fetch_results()
Fetching results
Transferring 3 Files... Done
Lets see what’s in the results, and if the file has been returned as expected:
[7]:
print(merge_files.results)
[None]
[8]:
with open(f'{merge_files.local_dir}/output.txt', 'r') as o:
print(o.read())
hello, world!
add me to the output!
Looks like it worked. Since the function itself does not return anything, we see None
in the resutlts.
File Paths¶
When using this feature, it’s important to pay attention to the locations of your files, as it’s easy to get confused.
extra_files_send
bases its locations on the current working directory from where the datset is run. When the run()
command was issued for this example: Dataset
will have looked within os.getcwd()
for the files temp_file_a.txt
and temp_file_b.txt
. In short, it operates between pwd
and the remote dir.
extra_files_recv
is slightly different, operating between the local_dir
and remote directory. This can be seen in the above example in that the output.txt
is dropped into the local_dir
rather than where the input files were sourced.
Fine Control¶
Added in version 0.12.3.
If the standard behaviour of files being sent between the working dir and remote dir aren’t to your liking, there are other options.
While slightly more complex in terms of syntax, you also have the option of having fine control over your file locations.
Dict control¶
One way to do this is to specify your listings as dictionaries. This takes the form:
[9]:
extra_files_send = [{"local/path/to/file.txt": "path/to/target"}]
Note
It is assumed that the file name will be identical on the remote and local sides. If you need to change the name, you should do so within your Function.
In this case, it tells remotemanager
that the extra file file.txt
can be found in the directory local/path/to/file/
, and that we want it to be sent to a directory named path/to/target
relative to the dataset remote_dir
.
Paths¶
Note that remote_dir
is relative to the Dataset.remote_dir
property, unless you specify an absolute path.
If we assume that we have a Dataset with the remote_dir
set to remote_run
, then we can send file.txt
to remote_run/inner_dir
using:
[10]:
extra_files_send = [{"file.txt": "inner_dir"}]
However:
[11]:
extra_files_send = [{"file.txt": "/home/user/run_data"}]
Would send file.txt
to /home/user/run_data
Demonstration¶
We can demonstrate this with a simple function that reads the contents of a file.
[12]:
def read(file):
with open(file) as o:
return o.read()
def create_file(fname):
with open(fname, "w+") as o:
o.write("foo")
[13]:
ds = Dataset(read, name="read", skip=False)
The following setup is the same as the standard behaviour. We can print the intended remote directory of the extra file by accessing its remote
property.
Note
Internally, your file specs are converted into a list of TrackedFile
objects, so all the methods available to these can be used here.
[14]:
create_file("tmp_standard.txt")
ds.append_run({"file": "tmp_standard.txt"}, extra_files_send=[{"tmp_standard.txt": ""}])
appended run runner-0
[15]:
print("Remote path for standard file:", ds.runners[0].extra_files_send[0].remote) # print the remote, for debugging
Remote path for standard file: temp_runner_remote/tmp_standard.txt
Note
To send to the remote_dir you can use the empty string ""
or the “current dir” shortcut "."
.
To send the file to a directory within the remote_dir, we can use this setup:
[16]:
create_file("tmp_dir.txt")
ds.append_run({"file": "inner_dir/tmp_dir.txt"}, extra_files_send=[{"tmp_dir.txt": "inner_dir"}])
appended run runner-1
[17]:
print("Remote path for inner_dir file:", ds.runners[1].extra_files_send[0].remote)
Remote path for inner_dir file: temp_runner_remote/inner_dir/tmp_dir.txt
Otherwise, we can send the file to any arbitrary directory, if we know the abspath.
Note
Note that this will add a level of machine dependence to your run, as remotemanager expects that this path is valid and exsts.
[18]:
# create path using $HOME to allow testing
home = os.path.expandvars("$HOME")
path = os.path.join(home, "test")
file = os.path.join(path, "tmp_abs.txt")
[20]:
create_file("tmp_abs.txt")
ds.append_run({"file": file}, extra_files_send=[{"tmp_abs.txt": path}])
appended run runner-2
[21]:
print("Remote path for abspath file:", ds.runners[2].extra_files_send[0].remote)
Remote path for abspath file: /home/test/test/tmp_abs.txt
[22]:
ds.run()
ds.wait(1, 10)
Staging Dataset... Staged 3/3 Runners
Transferring for 3/3 Runners
Transferring 12 Files in 4 Transfers... Done
Remotely executing 3/3 Runners
If we collect the results we should see that all the files have been read in by the function.
[23]:
ds.fetch_results()
ds.results
Fetching results
Transferring 6 Files... Done
[23]:
['foo', 'foo', 'foo']
TrackedFile¶
Internally, all extra files are converted to TrackedFile
instances. This allows the option for specifying these directly, if you prefer. Lets add an extra runner which displays this behaviour:
[24]:
from remotemanager.storage import TrackedFile
tfile = TrackedFile(".", ds.remote_dir, "trackedfile.txt")
# we can now use the write method of the TrackedFile class to add content to this file
tfile.write("foo, tracked")
ds.append_run({"file": tfile.name}, extra_files_send = [tfile])
appended run runner-3
Running this dataset again will run the new runner, showing the new file with its content:
[25]:
ds.run()
ds.wait(1, 10)
Staging Dataset... Staged 1/4 Runners
Transferring for 1/4 Runners
Transferring 6 Files in 2 Transfers... Done
Remotely executing 1/4 Runners
[26]:
ds.fetch_results()
ds.results
Fetching results
Transferring 2 Files... Done
[26]:
['foo', 'foo', 'foo', 'foo, tracked\n']
Important
The key points to setting up a TrackedFile
are that the setup args are as follows: TrackedFile(local_dir, remote_dir, filename)
. This sets up a file-like entity that provides the ability for remotemanager to “track” the behaviour between local_dir
and remote_dir
.
Retrieving Files¶
Using this methodology to collect files from your runs follows the same syntax. Keep in mind that the value of the dictionary is the remote specification, and the filename has to go in the key.
[27]:
extra_files_recv = [{"local/path/to/file.txt": "remote_path"}]
This would fetch "file.txt"
from temp_runner_remote/remote_path/file.txt
, and move it to local/path/to/file.txt
Note
Just like with sending, you can also use abspaths here.